in_process_exporter_metrics: Expire metrics for dead processes#11760
in_process_exporter_metrics: Expire metrics for dead processes#11760piwai wants to merge 6 commits intofluent:masterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRecords composite identity keys for active processes and threads during /proc scans and purges stale cmetric entries whose label-derived keys are not present in the active sets. Purging runs only when active-index tracking completes; active tables are freed on teardown. Changes
Sequence Diagram(s)sequenceDiagram
participant Scanner as Process Scanner
participant ThreadUpd as Thread Updater
participant Cmetrics as Cmetrics Manager
participant Logger as Logger
rect rgba(200,200,255,0.5)
Scanner->>Scanner: scan /proc, build active_pids (name|pid, name|pid|ppid)
end
rect rgba(200,255,200,0.5)
Scanner->>ThreadUpd: call process_thread_update(..., active_tids, &active_index_complete)
ThreadUpd->>ThreadUpd: discover threads, insert name|threadname|tid into active_tids
end
rect rgba(255,200,200,0.5)
Scanner->>Cmetrics: call purge_stale_metrics(active_pids, active_tids, active_index_complete)
alt active_index_complete == true
Cmetrics->>Cmetrics: iterate metrics, extract label values, compare to active sets
Cmetrics->>Cmetrics: remove stale metric entries
Cmetrics->>Logger: log removals
else active_index_complete == false
Cmetrics->>Logger: emit warning and skip purge
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
|
Here's the request valgrind log |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99f12369f0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 946-967: The purge_stale_metrics function currently checks only
PID/TID via get_metric_label_value against active_ids, which allows stale series
to survive on ID reuse; update purge_stale_metrics to build a collision-free key
for each metric (for example by concatenating the full label tuple or PID/TID
plus process start time) and use that key when querying flb_hash_table_get
instead of the single id_val so the exact emitted identity is checked before
calling cmt_map_metric_destroy; apply the same change to the equivalent logic
referenced around the other purge block (the similar code at the later section)
so both places use the full-label or start-time-based key.
- Line 524: flb_hash_table_add calls that populate
active_tids/active_pids/active_fds must not be ignored — check each
flb_hash_table_add return value and treat a failure as a fatal error for this
collection pass: on add failure, log a descriptive error (including the
tid/pid/fd value), set a flag like index_complete = false (or return an error
code) and skip the end-of-pass purge; alternatively return early from the
collection function so purge code does not run. Update every insertion site that
uses flb_hash_table_add (references: active_tids, active_pids, active_fds) to
perform this check and ensure the purge path is gated by index_complete or
skipped when an insertion failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4c226035-2a3c-4f70-b7f7-44e2c70a3119
📒 Files selected for processing (1)
plugins/in_process_exporter_metrics/pe_process.c
|
@piwai thanks for this contribution. Pls check the reviews provided by the Ai agent |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
plugins/in_process_exporter_metrics/pe_process.c (1)
526-533:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift
name:idis still not a collision-free liveness key.The purge now matches on
name + pid/tid, but several maps distinguish series by more than that (ppidfor process series,threadnamefor thread series). If an ID is reused for another process/thread with the same name before the next scrape, the lookup still hits and the oldppid/threadnameseries survives. That means stale series can still accumulate under reuse-heavy workloads.Please purge against the exact emitted identity for each map, or another collision-free identity such as start time.
Also applies to: 955-983, 1099-1106, 1267-1286
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_process_exporter_metrics/pe_process.c` around lines 526 - 533, The current liveness key built in the active_tids purge (created via flb_sds_create(name) + ":" + tid_str and added with flb_hash_table_add(active_tids, active_key, ...)) is not collision-free because it omits other identity fields (e.g., ppid for process maps, threadname for thread maps); update the key construction wherever you see this pattern (the active_key + flb_hash_table_add calls in this file) to include the exact emitted identity used when the series is created—either append ppid or threadname (or the process start time) to the key so the purge matches the full emitted identity, and apply the same fix to the other similar blocks that build active_key before calling flb_hash_table_add.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 1244-1247: When process_thread_update(ctx, ts, pid_str, name,
active_tids, &active_index_complete) returns -1 you must mark the active-index
gate invalid so the purge doesn't remove live thread metrics; update the caller
to set active_index_complete = false (or otherwise invalidate the purge gate)
immediately in the error branch where ret == -1. Apply the same change to the
other failing path referenced (the similar block around the 1265-1289 range) so
both failure paths for process_thread_update() invalidate active_index_complete
when thread enumeration fails.
---
Duplicate comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 526-533: The current liveness key built in the active_tids purge
(created via flb_sds_create(name) + ":" + tid_str and added with
flb_hash_table_add(active_tids, active_key, ...)) is not collision-free because
it omits other identity fields (e.g., ppid for process maps, threadname for
thread maps); update the key construction wherever you see this pattern (the
active_key + flb_hash_table_add calls in this file) to include the exact emitted
identity used when the series is created—either append ppid or threadname (or
the process start time) to the key so the purge matches the full emitted
identity, and apply the same fix to the other similar blocks that build
active_key before calling flb_hash_table_add.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6fb43019-e2d2-4ca2-b95c-1fa47e7631c9
📒 Files selected for processing (1)
plugins/in_process_exporter_metrics/pe_process.c
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 526-533: The SDS key construction code using flb_sds_create and
flb_sds_cat_safe must check for allocation/appending failure before calling
flb_sds_len or flb_hash_table_add; update the block that builds active_key to
validate that flb_sds_create returned non-NULL and that each flb_sds_cat_safe
succeeded, and if any step fails set *active_index_complete = FLB_FALSE, avoid
calling flb_sds_len or flb_hash_table_add with a NULL key, free/destroy any
partially created SDS via flb_sds_destroy if non-NULL, and return/continue as
appropriate; apply the same defensive checks to the other SDS key construction
sites (the analogous blocks around lines 976-983 and 1099-1106) to prevent NULL
dereference in flb_sds_len or hash operations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fbc9fa0c-f17d-49f5-8780-04c410d91e63
📒 Files selected for processing (1)
plugins/in_process_exporter_metrics/pe_process.c
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/in_process_exporter_metrics/pe_process.c (1)
1151-1158:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInvalidate the purge index when
ppidparsing fails.By this point the
name|pidkey has already been inserted, but thename|pid|ppidkey has not. If Line 1153 or Line 1158 jumps to cleanup,active_pidsis partial whileactive_index_completestays true, so the final purge can delete live metrics for every map keyed by{name,pid,ppid}.Suggested fix
tmp = strstr(entry->str, ")"); if (tmp == NULL) { + active_index_complete = FLB_FALSE; goto cleanup; } ret = flb_slist_split_string(&split_list, tmp+2, ' ', -1); if (ret == -1) { + active_index_complete = FLB_FALSE; goto cleanup; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_process_exporter_metrics/pe_process.c` around lines 1151 - 1158, Parsing failures for ppid (when tmp == NULL or flb_slist_split_string returns -1) currently jump to cleanup without invalidating the purge index, leaving active_pids partial while active_index_complete remains true; update the error paths in the ppid parsing block inside pe_process.c (the section using entry->str, tmp, and flb_slist_split_string(&split_list,...)) so that before jumping to the cleanup label you set active_index_complete = false (and optionally clear any partial entries in active_pids if present) to prevent premature purging of live {name,pid,ppid}-keyed metrics.
♻️ Duplicate comments (2)
plugins/in_process_exporter_metrics/pe_process.c (2)
1301-1304:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInvalidate the purge gate when thread scanning aborts.
If
process_thread_update()returns-1,active_tidsis incomplete for that process butactive_index_completeremains true. A transient/proc/<pid>/taskfailure can then make the end-of-pass purge remove live thread metrics.Suggested fix
ret = process_thread_update(ctx, ts, pid_str, name, active_tids, &active_index_complete); if (ret == -1) { + active_index_complete = FLB_FALSE; flb_plg_debug(ctx->ins, "collect thread procfs is failed on the pid = %s", pid_str); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_process_exporter_metrics/pe_process.c` around lines 1301 - 1304, When process_thread_update(ctx, ts, pid_str, name, active_tids, &active_index_complete) returns -1, mark the purge gate as invalid so incomplete active_tids cannot trigger end-of-pass purges; specifically, in the error branch where ret == -1 set active_index_complete = false (or otherwise invalidate the purge condition) for that pid and ensure subsequent logic treats that process as incomplete so live thread metrics are not purged prematurely.
544-555:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe
|-joined identity keys are still collision-prone.These tables now mirror the same plain-text encoding on insert and purge, but the encoding is not injective:
active_pidsmixesname|pidandname|pid|ppid, and process/thread names can legally contain|. Distinct label tuples can therefore collapse to the same hash key, which keeps stale series alive or purges the wrong series. Please switch both builders and lookups to a collision-free encoding instead of delimiter-joined strings.Also applies to: 970-1016, 1131-1145, 1170-1185
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/in_process_exporter_metrics/pe_process.c` around lines 544 - 555, The current "|"‑joined keys (built via flb_sds_create + flb_sds_cat_safe and inserted with flb_hash_table_add) are not injective because names can contain "|" and some tables mix different tuple arities; replace this with a collision-free encoding (e.g., length-prefixed fields or a binary composite key) for both builders and lookups: change the code that builds active_key (references: active_key, flb_sds_create, flb_sds_cat_safe) and the hash table inserts/queries (references: flb_hash_table_add, active_tids, active_pids) to produce a unique binary key (field lengths + data or a fixed-struct key) and update all corresponding lookups/purges (also modify the same pattern at the other occurrences noted) so key creation and key comparison are consistent and no delimiter ambiguity remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 544-554: The key-construction error path leaks the buffer returned
by get_name() (thread_name) because the early continue skips cleanup; modify the
failure branch in the active_key build (the if that sets *active_index_complete
= FLB_FALSE) to free thread_name (and tid_str if it was allocated similarly)
before calling flb_sds_destroy(active_key) and continue. Locate the code around
active_key creation/concatenation (the flb_sds_create and flb_sds_cat_safe
calls) and insert the appropriate deallocation (e.g., flb_free(thread_name) or
free(thread_name) consistent with how get_name() allocates) just prior to the
continue so no leak remains.
---
Outside diff comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 1151-1158: Parsing failures for ppid (when tmp == NULL or
flb_slist_split_string returns -1) currently jump to cleanup without
invalidating the purge index, leaving active_pids partial while
active_index_complete remains true; update the error paths in the ppid parsing
block inside pe_process.c (the section using entry->str, tmp, and
flb_slist_split_string(&split_list,...)) so that before jumping to the cleanup
label you set active_index_complete = false (and optionally clear any partial
entries in active_pids if present) to prevent premature purging of live
{name,pid,ppid}-keyed metrics.
---
Duplicate comments:
In `@plugins/in_process_exporter_metrics/pe_process.c`:
- Around line 1301-1304: When process_thread_update(ctx, ts, pid_str, name,
active_tids, &active_index_complete) returns -1, mark the purge gate as invalid
so incomplete active_tids cannot trigger end-of-pass purges; specifically, in
the error branch where ret == -1 set active_index_complete = false (or otherwise
invalidate the purge condition) for that pid and ensure subsequent logic treats
that process as incomplete so live thread metrics are not purged prematurely.
- Around line 544-555: The current "|"‑joined keys (built via flb_sds_create +
flb_sds_cat_safe and inserted with flb_hash_table_add) are not injective because
names can contain "|" and some tables mix different tuple arities; replace this
with a collision-free encoding (e.g., length-prefixed fields or a binary
composite key) for both builders and lookups: change the code that builds
active_key (references: active_key, flb_sds_create, flb_sds_cat_safe) and the
hash table inserts/queries (references: flb_hash_table_add, active_tids,
active_pids) to produce a unique binary key (field lengths + data or a
fixed-struct key) and update all corresponding lookups/purges (also modify the
same pattern at the other occurrences noted) so key creation and key comparison
are consistent and no delimiter ambiguity remains.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4acaf5ca-5287-46fe-bffe-a9a850369e8c
📒 Files selected for processing (1)
plugins/in_process_exporter_metrics/pe_process.c
Signed-off-by: Pierre-Yves Rofes <3604235+piwai@users.noreply.github.com>
Signed-off-by: Pierre-Yves Rofes <3604235+piwai@users.noreply.github.com>
Signed-off-by: Pierre-Yves Rofes <3604235+piwai@users.noreply.github.com>
Signed-off-by: Pierre-Yves Rofes <3604235+piwai@users.noreply.github.com>
Signed-off-by: Pierre-Yves Rofes <3604235+piwai@users.noreply.github.com>
Signed-off-by: Pierre-Yves Rofes <3604235+piwai@users.noreply.github.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@edsiper Took some time, but I think I addressed all issues! For the hash table key construction using "|" characters, I consider that the proposed fix with fixed-length and NUL bytes would add extra complexity for very little benefit, as the collision is very unlikely and even if it happens, it will be purged eventually in case of high load.
|
There was a problem hiding this comment.
This is really interesting and prominent idea!
But we don't want to touch the internal structure of cmetrics from Fluent Bit's plugins as much as possible. So, could you try to implement this kind of feature inside of cmetrics if possible?
We're also seeking an opportunity to purge stale metrics in some ways but currently not succeeded yet.
|
@cosmo0920 thanks a lot for the review! |
Yup, implementing cmt_expire() or similar function to provide expiring metrics.
We're also suffering for out of date issue of stale metrics in out_prometheus_remote_write or out_prometheus_expoter. |
This add an expiration mechanism for dead processes, avoiding publishing prometheus metrics for process which are not running anymore on the host.
Uses 2 hash tables to store process and threads id, to be able to expire metrics which are not present in these tables.
Fixes #9547
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
Bug Fixes
Improvements